{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "s1xszHqdIieh"
   },
   "source": [
    "# Mistral 7B LM Eval Testing\n",
    "\n",
    "Code to evaluate three variants of a Mistral-7B on the Open LLM Leaderboard eval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "uAsu08nhIzu1",
    "outputId": "484bf888-2f5c-420e-b1dc-32aebb67dc2e"
   },
   "outputs": [],
   "source": [
    "!nvcc --version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "nUhy3WQIIzD4",
    "outputId": "86692d6e-77a5-4c0c-f14e-88e90ead40b2"
   },
   "outputs": [],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Zfjv9nzQSyZ0",
    "outputId": "503420c0-1cf5-4392-d600-ae3e7f2db00e"
   },
   "outputs": [],
   "source": [
    "!pip install -q -U transformers peft torch accelerate einops sentencepiece bitsandbytes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "8Bw1lhjhYhw1",
    "outputId": "37e4cb70-dcf2-4591-b561-b558dd02b674"
   },
   "outputs": [],
   "source": [
    "# clone repository\n",
    "!git clone https://github.com/EleutherAI/lm-evaluation-harness.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "oWagqq3cwLgF",
    "outputId": "bfee561d-8edf-423c-ac06-c33c998bb2c7"
   },
   "outputs": [],
   "source": [
    "# change to repo directory\n",
    "import os\n",
    "\n",
    "os.chdir(\"/content/lm-evaluation-harness\")\n",
    "# install\n",
    "!pip install -e ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "QSp37lvCq65V"
   },
   "outputs": [],
   "source": [
    "import datetime\n",
    "\n",
    "now = datetime.datetime.now()\n",
    "now = now.strftime(\"%Y_%m_%d_%H_%M_%S\")\n",
    "\n",
    "os.mkdir(f\"/content/{now}\")\n",
    "os.mkdir(f\"/content/{now}/arc\")\n",
    "os.mkdir(f\"/content/{now}/hellaswag\")\n",
    "os.mkdir(f\"/content/{now}/mmlu\")\n",
    "os.mkdir(f\"/content/{now}/truthfulqa\")\n",
    "os.mkdir(f\"/content/{now}/winogrande\")\n",
    "os.mkdir(f\"/content/{now}/gsm8k\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "RieFKmwdq65V"
   },
   "outputs": [],
   "source": [
    "os.environ[\"now_log_folder\"] = now"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "B32ZQxvtIIUt"
   },
   "source": [
    "# arc challenge\n",
    "\n",
    "AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "KY5ssp_iRu7O",
    "outputId": "ac50e638-2c9c-4a36-d1d1-8236434cdc6f"
   },
   "outputs": [],
   "source": [
    "!lm_eval --model hf \\\n",
    "    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype=\"bfloat16\" \\\n",
    "    --tasks arc_challenge \\\n",
    "    --batch_size 16 \\\n",
    "    --write_out \\\n",
    "    --output_path /content/$now_log_folder/arc/arc_challenge_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \\\n",
    "    --device cuda:0 \\\n",
    "    --num_fewshot 25 \\\n",
    "    --verbosity DEBUG\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mLyNCt57IQ7f"
   },
   "source": [
    "# hellaswag\n",
    "\n",
    "* HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "background_save": true,
     "base_uri": "https://localhost:8080/"
    },
    "id": "ylb0Xp_prHCy",
    "outputId": "502ad697-fbd0-4d0b-b2c9-586408831c37"
   },
   "outputs": [],
   "source": [
    "!lm_eval --model hf \\\n",
    "    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype=\"bfloat16\" \\\n",
    "    --tasks hellaswag \\\n",
    "    --batch_size 16 \\\n",
    "    --write_out \\\n",
    "    --output_path /content/$now_log_folder/hellaswag/hellaswag_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \\\n",
    "    --device cuda:0 \\\n",
    "    --num_fewshot 10 \\\n",
    "    --verbosity DEBUG\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "h03ywoXsIdM6"
   },
   "source": [
    "# MMLU\n",
    "\n",
    "MMLU (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "UdFcxfa9rQ97",
    "outputId": "f379128c-6db8-451c-c3ce-0bc61fac29d6"
   },
   "outputs": [],
   "source": [
    "!lm_eval --model hf \\\n",
    "    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype=\"bfloat16\" \\\n",
    "    --tasks mmlu \\\n",
    "    --batch_size 4 \\\n",
    "    --write_out \\\n",
    "    --output_path /content/$now_log_folder/mmlu/mmlu_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \\\n",
    "    --device cuda:0 \\\n",
    "    --num_fewshot 5 \\\n",
    "    --verbosity DEBUG\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yMZrYPKOLrZz"
   },
   "source": [
    "# TruthfulQA\n",
    "\n",
    "TruthfulQA (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "background_save": true
    },
    "id": "SzSg6FlKrjZk",
    "outputId": "af1793e1-4d9c-49c6-b03c-2df8615495f1"
   },
   "outputs": [],
   "source": [
    "!lm_eval --model hf \\\n",
    "    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype=\"bfloat16\" \\\n",
    "    --tasks truthfulqa_mc2 \\\n",
    "    --batch_size 16 \\\n",
    "    --write_out \\\n",
    "    --output_path /content/$now_log_folder/truthfulqa/truthfulqa_mc2_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \\\n",
    "    --device cuda:0 \\\n",
    "    --num_fewshot 0 \\\n",
    "    --verbosity DEBUG\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZzeqfQaqLrff"
   },
   "source": [
    "# Winogrande\n",
    "Winogrande (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "background_save": true
    },
    "id": "LduQR_dKr31q",
    "outputId": "ead6adf4-fb31-47ec-d49f-cce81fdf9d18"
   },
   "outputs": [],
   "source": [
    "!lm_eval --model hf \\\n",
    "    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype=\"bfloat16\" \\\n",
    "    --tasks winogrande \\\n",
    "    --batch_size 16 \\\n",
    "    --write_out \\\n",
    "    --output_path /content/$now_log_folder/winogrande/winogrande_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \\\n",
    "    --device cuda:0 \\\n",
    "    --num_fewshot 5 \\\n",
    "    --verbosity DEBUG\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yLWIv0jJLynA"
   },
   "source": [
    "# GSM8k\n",
    "\n",
    "GSM8k (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "hVWrx61vrk_I",
    "outputId": "d1388f66-12ae-414a-f54a-0898ff9747a2"
   },
   "outputs": [],
   "source": [
    "\n",
    "!lm_eval --model hf \\\n",
    "    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,dtype=\"bfloat16\" \\\n",
    "    --tasks gsm8k \\\n",
    "    --batch_size 16 \\\n",
    "    --write_out \\\n",
    "    --output_path /content/$now_log_folder/gsm8k/gsm8k_mistralai_Mistral-7B-Instruct-v0.2_lm_eval.json \\\n",
    "    --device cuda:0 \\\n",
    "    --num_fewshot 5 \\\n",
    "    --verbosity DEBUG\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "t66XpVTIzuHt"
   },
   "source": [
    "# Zip Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "background_save": true
    },
    "id": "KjVi9CKBzw50",
    "outputId": "671a7574-9694-439d-bbc6-b915a42f10f5"
   },
   "outputs": [],
   "source": [
    "!zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9ZrKoTc0rlIP"
   },
   "source": [
    "# Task List"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "background_save": true
    },
    "id": "onoPU1Cmk5EF",
    "outputId": "f9d80032-7b32-4a5f-9dd4-6a5041efb88b"
   },
   "outputs": [],
   "source": [
    "!lm-eval --tasks list"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "A100",
   "machine_shape": "hm",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
